Branching on Attribute Values in Decision Tree Generation
نویسنده
چکیده
The problem of deciding which subset of values of a categorical-valued attribute to branch on during decision tree generation is addressed. Algorithms such as ID3 and C4 do not address the issue and simply branch on each value of the selected attribute. The GID3* algorithm is presented and evaluated. The GID3* algorithm is a generalized version of Quinlan’s ID3 and C4, and is a non-parametric version of the GID3 algorithm presented in an earlier paper. It branches on a subset of individual values of an attribute, while grouping the rest under a single DEFAULT branch. It is empirically demonstrated that GID3* outperforms ID3 (C4) and GID3 for any parameter setting of the latter. The empirical tests include both controlled synthetic (randomized) domains as well as real-world data sets. The improvement in tree quality as measured by number of leaves and estimated error rate is significant. Introduction Empirical learning algorithms attempt to discover relations between situations expressed in terms of a set of attributes and actions encoded in terms of a fixed set of classes. By examining large sets of pre-classified data, it is hoped that a learning program may discover the proper conditions under which each action (class) is appropriate. Heuristic methods are used to perform guided search through the large space of possible relations between combinations of attribute values and classes. A powerful and popular such heuristic uses the notion of selecting attributes that locally minimize the information entropy of the classes in a data set. This heuristic is used in the ID3 algorithm [l l] and its extensions, e.g. GID3 [2], GID3* [4], and C4 [12], in CART [l], in CN2 [3] and others; see [4, 5, lo] for a general discussion of the attribute selection problem. The attributes in a learning problem may be discrete (categorical), or they may be continuous (numerical). The above mentioned attribute selection process assumes that all attributes are discrete. Continuous-valued attributes must, therefore, be discretized prior to attribute selection. This is typically achieved by partitioning the range of the attribute into subranges, i.e., a test is devised that quantizes the range. In this paper, we focus only on the problem of deciding which values of a discrete-valued (or discretized) attribute should be branched on, and which should not. We propose that by avoiding branching on all values (as in ID3), better trees are obtained. We originally developed the GID3 algorithm [2] to address this problem. GID3 is dependent on a user-determined parameter setting (TL) that controls its tendency towards branching on some versus all values of an attribute. We have demonstrated that for certain settings of TL, GID3 produces significantly better trees than ID3 or C4l. In this paper we present the GID3* algorithm in which the dependence on a user-specified parameter has been removed. We empirically demonstrate that GID3* produces better trees than GID3 for a wide range of parameter settings. The Attribute Selection Criterion Assume we are to select an attribute for branching at a node having a set S of N examples from a set of k classes: {Cl,..., Ck}. Assuming that some test T on attribute A partitions the set S into the subsets St, . . . , S,. Let P(Ci, S) be the proportion of examples in S that have class C,:. The class e&&y of a subset S is defined as: Ent(S) = 2 P(ci, S) log(p(ci, s>>
منابع مشابه
Application of Different Methods of Decision Tree Algorithm for Mapping Rangeland Using Satellite Imagery (Case Study: Doviraj Catchment in Ilam Province)
Using satellite imagery for the study of Earth's resources is attended by manyresearchers. In fact, the various phenomena have different spectral response inelectromagnetic radiation. One major application of satellite data is the classification ofland cover. In recent years, a number of classification algorithms have been developed forclassification of remote sensing data. One of the most nota...
متن کاملA Comparative Study on Decision Rule Induction for incomplete data using Rough Set and Random Tree Approaches
Handling missing attribute values is the greatest challenging process in data analysis. There are so many approaches that can be adopted to handle the missing attributes. In this paper, a comparative analysis is made of an incomplete dataset for future prediction using rough set approach and random tree generation in data mining. The result of simple classification technique (using random tree ...
متن کاملI. INTRODUCTION ECISION tree induction is a popular method for mining knowledge from data by means of decision tree building
In decision analysis, decision trees are commonly used as a visual support tool for identifying the best strategy that is most likely to reach a desired goal. A decision tree is a hierarchical structure normally represented as a tree-like graph model. The tree consists of decision nodes, splitting paths based on the values of a decision node, and sink nodes representing final decisions. In data...
متن کاملHandling Missing Attribute Values
In this chapter methods of handling missing attribute values in data mining are described. These methods are categorized into sequential and parallel. In sequential methods, missing attribute values are replaced by known values first, as a preprocessing, then the knowledge is acquired for a data set with all known attribute values. In parallel methods, there is no preprocessing, i.e., knowledge...
متن کاملA Counter Example to the Stronger Version of the Binary Tree Hypothesis
The paper describes a counter example to the hypothesis which states that a greedy decision tree generation algorithm that constructs binary decision trees and branches on a single attribute-value pair rather than on all values of the selected attribute will always lead to a tree with fewer leaves for any given training set. We show also that RELIEFF is less myopic than other impurity functions...
متن کامل